fix(prometheus_scrape source): scrape endpoints in parallel #17660

wjordan · 2023-06-09T23:20:44Z

netlify · 2023-06-09T23:20:50Z

✅ Deploy Preview for vector-project ready!

Name	Link
🔨 Latest commit	`5557211`
🔍 Latest deploy log	https://app.netlify.com/sites/vector-project/deploys/6483b3cee5520000089afbf8
😎 Deploy Preview	https://deploy-preview-17660--vector-project.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

netlify · 2023-06-09T23:20:52Z

✅ Deploy Preview for vrl-playground canceled.

Name	Link
🔨 Latest commit	`5557211`
🔍 Latest deploy log	https://app.netlify.com/sites/vrl-playground/deploys/6483b3ce15b52800088c2daf

wjordan · 2023-06-12T17:44:44Z

This still needs some additional work:

(blocking issue) Pending requests can start to pile up on the heap causing unbounded memory growth. Scrape requests should implement a timeout less than the interval duration to prevent this.
- Alternatively, skip future requests for the same endpoint if a previous scrape request still hasn't completed.
Each request should spawn in a separate short-lived task to spread out the request-processing load across many threads.
- Alternatively, each endpoint could spawn and reuse a single long-lived task which could be more efficient.
The timing of endpoint requests should be distributed across the scrape interval instead of all executing at the same time, to spread out the scrape-request load more evenly.

jszwedko · 2023-06-12T20:01:34Z

Thanks for opening this!

(blocking issue) Pending requests can start to pile up on the heap causing unbounded memory growth. Scrape requests should implement a timeout less than the interval duration to prevent this.

Alternatively, skip future requests for the same endpoint if a previous scrape request still hasn't completed.

This seems like a separate issue no? Or is that behavior introduced by this PR?

Each request should spawn in a separate short-lived task to spread out the request-processing load across many threads.

Alternatively, each endpoint could spawn and reuse a single long-lived task which could be more efficient.

I'm not sure I follow this one. The normal pattern would be to just execute the tasks async and let tokio manage scheduling. I think that's what is happening here?

The timing of endpoint requests should be distributed across the scrape interval instead of all executing at the same time, to spread out the scrape-request load more evenly.

This feels like a nice-to-have. I wouldn't consider it blocking this PR.

StephenWakely · 2023-06-29T12:33:37Z

This seems like a separate issue no? Or is that behavior introduced by this PR?

So before if you have a slow Prometheus endpoint, the single request will wait until it is finished. This does mean that, say you were scraping every 15 seconds - you may actually only get metrics every 30 seconds. Technically there is some data loss. Memory usage remains constant.

With this PR, the call to scrape occurs every 15 seconds regardless of how slow the endpoint is. The result is that if the scrape takes longer than 15 seconds the request is placed in a queue. This queue can get longer and longer and memory usage will creep up. Presumably Vector will eventually be killed.

I had to push things fairly hard to get a significant increase in memory. 10 endpoints scraped every second with a 20 second lag time for each request was using about 5gb after 30 minutes or so. So I'm not sure how much of a problem this would be in the real world.

Ideally we should implement the workarounds suggested by @wjordan. We just need to decide if we should do that before merging this or after?

spencergilbert · 2023-06-29T13:58:08Z

Ideally we should implement the workarounds suggested by @wjordan. We just need to decide if we should do that before merging this or after?

I'd say before merging, I doubt we'd find the time to prioritize that unless/until someone reports it as a bug - which isn't a great user experience.

nullren · 2023-07-19T01:50:53Z

@wjordan i couldn't push to your branch, but i added timeouts here #18021. i'll make the timeouts configurable tomorrow 👍

one thing i noticed was that adding lots of endpoints, there was a bit of "set up" time where all the futures get created first and then they're run. so for example, with like 20 endpoints, when the first "tick" started, it was about 7 seconds for the actual requests to go out. not particularly problematic, but it was unexpected and not something i fully understand 🤔

edit: fixed ☝️ in my pr

wjordan · 2023-07-20T18:10:51Z

@nullren has been making further progress on this in #18021, so I'm closing this PR in favor of that one, and add some followup to previous discussion here:

Each request should spawn in a separate short-lived task to spread out the request-processing load across many threads.

Alternatively, each endpoint could spawn and reuse a single long-lived task which could be more efficient.

I'm not sure I follow this one. The normal pattern would be to just execute the tasks async and let tokio manage scheduling. I think that's what is happening here?

My Rust / Tokio knowledge is limited, so I could be completely mistaken on this one. As I understand it, the current setup runs the scrapes async but not in separate tasks, so tokio does the network io in parallel, but can't run the request-processing load (HTTP parsing, event enrichment etc) across many threads, unless each client-request future is also wrapped in tokio::spawn. It's possible the task-spawning and context-switching overhead is much greater than the actual request-processing load anyway though, and may need profiling to know for sure - in any case it's a small optimization detail at best and probably not terribly important.

I had to push things fairly hard to get a significant increase in memory. 10 endpoints scraped every second with a 20 second lag time for each request was using about 5gb after 30 minutes or so. So I'm not sure how much of a problem this would be in the real world.

For reference, the real-world use cases both @nullren and I are working with involve thousands of endpoints- I had to kill Vector within seconds in my local testing.

wjordan · 2023-07-20T18:11:48Z

Closing in favor of #18021

@wjordan

…timeouts (#18021)  fixes #14087 fixes #14132 fixes #17659 - [x] make target timeout configurable this builds on what @wjordan did in #17660 ### what's changed - prometheus scrapes happen concurrently - requests to targets can timeout - the timeout can be configured (user facing change) - small change in how the http was instantiated --------- Co-authored-by: Doug Smith <dsmith3197@users.noreply.github.com> Co-authored-by: Stephen Wakely <stephen@lisp.space>

fix(prometheus_scrape source): scrape endpoints in parallel

5557211

wjordan requested a review from a team June 9, 2023 23:20

github-actions bot added the domain: sources Anything related to the Vector's sources label Jun 9, 2023

jszwedko requested a review from StephenWakely June 26, 2023 15:25

nullren mentioned this pull request Jul 13, 2023

Multiple prometheus_scrape endpoints not scraped in parallel #17659

Closed

nullren mentioned this pull request Jul 20, 2023

enhancement(prometheus_scrape source): run requests in parallel with timeouts #18021

Merged

1 task

wjordan closed this Jul 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(prometheus_scrape source): scrape endpoints in parallel #17660

fix(prometheus_scrape source): scrape endpoints in parallel #17660

wjordan commented Jun 9, 2023

netlify bot commented Jun 9, 2023 •

edited

Loading

netlify bot commented Jun 9, 2023 •

edited

Loading

wjordan commented Jun 12, 2023

jszwedko commented Jun 12, 2023 •

edited

Loading

StephenWakely commented Jun 29, 2023

spencergilbert commented Jun 29, 2023

nullren commented Jul 19, 2023 •

edited

Loading

wjordan commented Jul 20, 2023

wjordan commented Jul 20, 2023

fix(prometheus_scrape source): scrape endpoints in parallel #17660

fix(prometheus_scrape source): scrape endpoints in parallel #17660

Conversation

wjordan commented Jun 9, 2023

netlify bot commented Jun 9, 2023 • edited Loading

✅ Deploy Preview for vector-project ready!

netlify bot commented Jun 9, 2023 • edited Loading

✅ Deploy Preview for vrl-playground canceled.

wjordan commented Jun 12, 2023

jszwedko commented Jun 12, 2023 • edited Loading

StephenWakely commented Jun 29, 2023

spencergilbert commented Jun 29, 2023

nullren commented Jul 19, 2023 • edited Loading

wjordan commented Jul 20, 2023

wjordan commented Jul 20, 2023

netlify bot commented Jun 9, 2023 •

edited

Loading

netlify bot commented Jun 9, 2023 •

edited

Loading

jszwedko commented Jun 12, 2023 •

edited

Loading

nullren commented Jul 19, 2023 •

edited

Loading